NMS-19647: Fix out-of-order sample loss by making remote write synchronous by cnewkirk · Pull Request #109 · OpenNMS/opennms-cortex-tss-plugin

cnewkirk · 2026-03-16T03:46:19Z

The store() method previously fired HTTP writes asynchronously via executeAsync() and returned immediately. When the RingBuffer's multiple worker threads dispatched consecutive batches containing samples for the same series, the async HTTP requests could arrive at the remote write endpoint out of timestamp order, causing the backend to reject the stale samples as out-of-order.

This change makes store() block until the HTTP write completes, ensuring the ring buffer worker thread does not process the next batch until the current write has landed. This preserves per-series timestamp ordering across consecutive WriteRequests as required by the Prometheus Remote Write spec.

Additionally fixes a bug where samplesLost incorrectly counted unfiltered samples (including NaN) instead of the actual samples that were attempted.

Validated via A/B E2E testing against Thanos Receive:

Baseline (async): 14 out-of-order / 5,262 appended (0.27% loss)
Fix (sync): 0 out-of-order / 5,264 appended (0.00% loss)
Throughput: identical (~5,260 samples over equal soak periods)
All 45 smoke tests passing on both runs

marshallmassengill · 2026-03-30T12:35:39Z

Created https://opennms.atlassian.net/browse/NMS-19647 for this one.

The store() method previously fired HTTP writes asynchronously via executeAsync() and returned immediately. When the RingBuffer's multiple worker threads dispatched consecutive batches containing samples for the same series, the async HTTP requests could arrive at the remote write endpoint out of timestamp order, causing the backend to reject the stale samples as out-of-order. This change makes store() block until the HTTP write completes, ensuring the ring buffer worker thread does not process the next batch until the current write has landed. This preserves per-series timestamp ordering across consecutive WriteRequests as required by the Prometheus Remote Write spec. Additionally fixes a bug where samplesLost incorrectly counted unfiltered samples (including NaN) instead of the actual samples that were attempted. Validated via A/B E2E testing against Thanos Receive: - Baseline (async): 14 out-of-order / 5,262 appended (0.27% loss) - Fix (sync): 0 out-of-order / 5,264 appended (0.00% loss) - Throughput: identical (~5,260 samples over equal soak periods) - All 45 smoke tests passing on both runs Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>

marshallmassengill changed the title ~~Fix out-of-order sample loss by making remote write synchronous~~ NMS-19647: Fix out-of-order sample loss by making remote write synchronous Apr 27, 2026

cnewkirk force-pushed the bugfix/improve-prom-writespec-compliance branch from c7f4ba6 to ea8d098 Compare April 27, 2026 16:04

cnewkirk force-pushed the bugfix/improve-prom-writespec-compliance branch from ea8d098 to b6e6b75 Compare April 27, 2026 16:05

marshallmassengill requested review from cgorantla and marshallmassengill April 28, 2026 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NMS-19647: Fix out-of-order sample loss by making remote write synchronous#109

NMS-19647: Fix out-of-order sample loss by making remote write synchronous#109
cnewkirk wants to merge 1 commit intoOpenNMS:masterfrom
cnewkirk:bugfix/improve-prom-writespec-compliance

cnewkirk commented Mar 16, 2026

Uh oh!

marshallmassengill commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cnewkirk commented Mar 16, 2026

Uh oh!

marshallmassengill commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants